Discovering Data and Information Quality Research Insights Gained through Latent Semantic Analysis

نویسندگان

  • Roger Blake
  • Ganesan Shankaranarayanan
چکیده

In the recent decade, the field of data and information quality (DQ) has grown into a research area that spans multiple disciplines. The motivation here is to help understand the core topics and themes that constitute this area and to determine how those topics and themes from DQ relate to business intelligence (BI). To do so, the authors present the results of a study which mines the abstracts of articles in DQ published over the last decade. Using Latent Semantic Analysis (LSA) six core themes of DQ research are identified, as well as twelve dominant topics comprising them. Five of these topics—decision support, database design and data mining, data querying and cleansing, data integration, and DQ for analytics—all relate to BI, emphasizing the importance of research that combines DQ with BI. The DQ topics from these results are profiled with BI, and used to suggest several opportunities for researchers. DOI: 10.4018/jbir.2012010101 2 International Journal of Business Intelligence Research, 3(1), 1-16, January-March 2012 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. ity”, consistent with earlier research (Ballou & Pazer, 1985; Pipino, Lee, & Wang, 2002). Given the extensive growth of DQ research and its likely continued growth in the future, it is important for researchers to understand the key research themes in DQ research and the popular research topics within each theme. DQ clearly has an enormous impact on the effectiveness of business intelligence (BI) and concurrent with DQ research, research of business intelligence has also grown significantly. The capabilities BI offers organizations have sharply increased and fact-based decision-making has become not just the norm for many companies, but is considered critical for success. As BI assumes ever more significance, so too does the need for a conceptual understanding of this field, as has been investigated by researchers (Foley & Guillemette, 2010). Since DQ is fundamental to BI, it is important to understand how the topics and themes of DQ research have evolved over time. It is also important to target BI-areas, related to DQ, that have not been addressed in the literature, and to identify the BI topic(s) within DQ that can garner the attention of practitioners and academics, both in BI and in DQ. Concepts analogous to those found in DQ research can be found in BI but, these are often constructed and represented differently. These concepts may overlap but are difficult to associate. For instance, accuracy and completeness are well-known to DQ researchers as two important DQ dimensions, each defined differently and distinctly from the other. Data mining research has investigated the same phenomena, but generally considered both as forms of “noise”. Some data mining studies have defined noise in a manner that is very similar to the definition of accuracy in DQ research. Other data mining studies examining noise have used definitions similar to those for completeness in DQ research (Blake & Mangiameli, 2011). Finding the core concepts in DQ research and how they relate to BI in order to point to opportunities for researchers is an important motivation for this study. Although they have not been explicitly connected to BI, there have been many attempts to define the core concepts of DQ research which have proposed frameworks to summarize and/or classify this area (Ge & Helfert, 2007; Lima, Maçada, & Vargas, 2006; Madnick, Wang, Yang, & Zhu, 2009; Neely & Cook, 2008). By examining the literature, each defines the classification framework from the respective researchers’ point-of-view. Although they offer invaluable insights into DQ research, we posit that there is a more interesting pointof-view that comes not from the researchers but from the research itself. What if the body of literature can inform us about the core themes and in addition associate the dominant core topics within each theme? What if we could examine relationships between those topics and themes and how they relate to business intelligence? The existing literature does not answer these questions. We believe that our methodology can answer these questions and more. Further, our methodology can be replicated to define the status of this (any) research field at any time in the future. Understanding the core topics and themes of an area of research is an important part of “establishing the identity” of that research area. Sidorova, Evangelopoulos, Valacich, and Ramakrishnan (2008) sought to do the same for the identity of the IS discipline, which they defined with five core areas. Our ambition in pursuing this research is similar to that of Sidrova et al. – to help define the identity of DQ research by understanding its core topics and themes. The objectives of this paper are to take an important step towards defining that identity through its topics and themes, to use the results of the analysis to guide researchers in this area, and to develop a reproducible method that can be used for similar purposes in the future. This paper makes the following contributions: (1) it identifies, from the current body of literature, the key themes in DQ research. (2) It further defines the key core topics within each theme. We strongly believe this offers a superior representation of the current state of research in this area. (3) It derives a mapping between research topics and the dimensions of DQ, which are central to much DQ research. This sheds light on how DQ research themes 14 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/article/discovering-data-informationquality-research/62019?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Business, Administration, and Management. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics

This paper discusses about the future of the World Wide Web development, called Semantic Web. Undoubtedly, Web service is one of the most important services on the Internet, which has had the greatest impact on the generalization of the Internet in human societies. Internet penetration has been an effective factor in growth of the volume of information on the Web. The massive growth of informat...

متن کامل

Query expansion based on relevance feedback and latent semantic analysis

Web search engines are one of the most popular tools on the Internet which are widely-used by expert and novice users. Constructing an adequate query which represents the best specification of users’ information need to the search engine is an important concern of web users. Query expansion is a way to reduce this concern and increase user satisfaction. In this paper, a new method of query expa...

متن کامل

Discovering Structure in Design Databases Through Functional and Surface Based Mapping

This work presents a methodology for discovering structure in design repository databases, toward the ultimate goal of stimulating designers through design-by-analogy. Using a Bayesian model combined with latent semantic analysis (LSA) for discovering structural form in data, an exploration of inherent structural forms, based on the content and similarity of design data, is undertaken to gain u...

متن کامل

Discovering task-oriented usage pattern for web recommendation

Web transaction data usually convey user task-oriented behaviour pattern. Web usage mining technique is able to capture such informative knowledge about user task pattern from usage data. With the discovered usage pattern information, it is possible to recommend Web user more preferred content or customized presentation according to the derived task preference. In this paper, we propose a Web r...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation

Identifying relevant papers from the literature is a common task in biocuration. Most current biomedical literature search systems primarily rely on matching user keywords. Semantic search, on the other hand, seeks to improve search accuracy by understanding the entities and contextual relations in user keywords. However, past research has mostly focused on semantically identifying biological e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IJBIR

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2012